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Abstract: We consider Bayesian variable selection in sparse high-dimensional regres¬ 
sion, where the number of covariates p may be large relative to the sample size n , but 
at most a moderate number q of covariates are active. Specifically, we treat generalized 
linear models. For a single fixed sparse model with well-behaved prior distribution, clas¬ 
sical theory proves that the Laplace approximation to the marginal likelihood of the 
model is accurate for sufficiently large sample size n. We extend this theory by giving 
results on uniform accuracy of the Laplace approximation across all models in a high¬ 
dimensional scenario in which p and q , and thus also the number of considered models, 
may increase with n. Moreover, we show how this connection between marginal likeli¬ 
hood and Laplace approximation can be used to obtain consistency results for Bayesian 
approaches to variable selection in high-dimensional regression. 
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1. Introduction 

A key issue in Bayesian approaches to model selection is the evaluation of the marginal 
likelihood, also referred to as the evidence, of the different models that are being considered. 
While the marginal likelihood may sometimes be available in closed form when adopting 
suitable priors, most problems require approximation techniques. In particular, this is the 
case for variable selection in generalized linear models such as logistic regression, which are 
the models treated in this paper. Different strategies to approximate the marginal likelihood 
are reviewed by Friel and Wyse (2012). Our focus will be on the accuracy of the Laplace 
approximation that is derived from large-sample theory; see also Bishop (2006, Section 4.4). 

Suppose we have n independent observations of a response variable, and along with each 
observation we record a collection of p covariates. Write L(j3) for the likelihood function of 
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a generalized linear model relating the response to the covariates, where (3 £ is a vector 
of coefficients in the linear predictor (McCullagh and Nelder, 1989). Let /(/3) be a prior 
distribution, and let /? be the maximum likelihood estimator (MLE) of the parameter vector 
(3 £ R p . Then the evidence for the (saturated) regression model is the integral 


[ L(f3)f(/3) d(3, 

J S.P 


and the Laplace approximation is the estimate 



where H denotes the negative Hessian of the log-likelihood function log L. 

Classical asymptotic theory for large sample size n but fixed number of covariates p shows 
that the Laplace approximation is accurate with high probability (Haughton, 1988). With p 
fixed, this then clearly also holds for variable selection problems in which we would consider 
every one of the finitely many models given by the 2 P subsets of covariates. This accuracy 
result justifies the use of the Laplace approximation as a proxy for an actual model evidence. 
The Laplace approximation is also useful for proving frequentist consistency results about 
Bayesian methods for variable selection for a general class of priors. This is again discussed 
in Haughton (1988). The ideas go back to the work of Schwarz (1978) on the Bayesian 
information criterion (BIC). 

In this paper, we set out to give analogous results on the interplay between Laplace ap¬ 
proximation, model evidence, and frequentist consistency in variable selection for regression 
problems that are high-dimensional, possibly with p > n, and sparse in that we consider 
only models that involve small subsets of covariates. We denote q as an upper bound on the 
number of active covariates. In variable selection for sparse high-dimensional regression, the 
number of considered models is very large, on the order of p q . Our interest is then in bounds 
on the approximation error of Laplace approximations that, with high probability, hold uni¬ 
formly across all sparse models. Theorem 1, our main result, gives such uniform bounds (see 
Section 3). A numerical experiment supporting the theorem is described in Section 4. 

In Section 5, we show that when adopting suitable priors on the space of all sparse models, 
model selection by maximizing the product of model prior and Laplace approximation is 
consistent in an asymptotic scenario in which p and q may grow with n. As a corollary, we 
obtain a consistency result for fully Bayesian variable selection methods. We note that the 
class of priors on models we consider is the same as the one that has been used to define 
extensions of BIC that have consistency properties for high-dimensional variable selection 
problems (see, for example, Bogdan, Ghosh and Doerge, 2004; Chen and Chen, 2008; Zak- 
Szatkowska and Bogdan, 2011; Chen and Chen, 2012; Frommlet et al., 2012; Luo and Chen, 
2013; Luo, Xu and Chen, 2015; Barber and Drton, 2015). The prior has also been discussed 
by Scott and Berger (2010). 

2. Setup and assumptions 

In this section, we provide the setup for the studied problem and the assumptions needed 
for our results. 
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2.1. Problem setup 

We treat generalized linear models for n independent observations of a response, which we 
denote as Yi,.. ., Y n . Each observation Yj follows a distribution from a univariate exponential 
family with density 

Pe{y) oc exp {y ■ 0 - b(0)} , 9 £ R, 

where the density is defined with respect to some measure on K. Let Qi be the (natural) 
parameter indexing the distribution of Y), so Y) ~ pg r The vector 9 = ( 9\ ,..., 9 n ) T is then 
assumed to lie in the linear space spanned by the columns of a design matrix X = (Xj.,) £ 
R nxp , that is, 9 = X/3 for a parameter vector /3 £ R p . Our work is framed in a setting 
with a fixed/deterministic design X. In the language of McCullagh and Nelder (1989), our 
basic setup uses a canonical link, no dispersion parameter and an exponential family whose 
natural parameter space is the entire real line. This covers, for instance, logistic and Poisson 
regression. However, extensions beyond this setting are possible; see for instance the related 
work of Luo and Chen (2013) whose discussion of Bayesian information criteria encompasses 
other link functions. 

We write Xi for the ith row of X, that is, the p- vector of covariate values for observation Y). 
The regression model for the responses then has log-likelihood, score, and negative Hessian 
functions 


log L{p) = Y,Yi- XjP ~ b(Xf P) £ R , 

i= 1 
n 

s((3) = J2 X > - b'(Xf/3)) £ R p , 

i—1 

n 

H(0) =Y^,XiX7 ■ b"(XfP) £ R pxp . 


The results in this paper rely on conditions on the Hessian H , and we note that, implicitly, 
these are actually conditions on the design X. 

We are concerned with a sparsity scenario in which the joint distribution of Yi,..., Y n 
is determined by a true parameter vector /3 q £ R p supported on a (small) set Jo C \p] := 
{1,... ,p}, that is, ft 0 j ^ 0 if and only if j £ J 0 . Our interest is in the recovery of the set 
Jo when knowing an upper bound q on the cardinality of Jo, so |Jo| < q. To this end, we 
consider the different submodels given by the linear spaces spanned by subsets J C [p] of the 
columns of the design matrix X, where | J| < q. 

For notational convenience, we take J C [p\ to mean either an index set for the covariates 
or the resulting regression model. The regression coefficients in model J form a vector of 
length |J|. We index such vectors ft by the elements of J, that is, ft = (Bj : j £ J), and 
we write R J for the Euclidean space containing all these coefficient vectors. This way the 
coefficient and the covariate it belongs to always share a common index. In other words, the 
coefficient for the j- th coordinate of covariate vector Xj is denoted by Bj in any model J 
with j £ J. Furthermore, it is at times convenient to identify a vector /3 £ R J with the vector 
in R p that is obtained from /? by filling in zeros outside of the set J. As this is clear from the 
context, we simply write /3 again when referring to this sparse vector in R p . Finally, sj(/3) 
and Hj(fi) denote the subvector and submatrix of s(/3) and H(/3), respectively, obtained by 
extracting entries indexed by J. These depend only on the subvectors Xu = {X i j)j e j of the 
covariate vectors Xj. 
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2.2. Assumptions 

Recall that n is the sample size, p is the number of covariates, q is an upper bound on the 
model size, and /3 0 is the true parameter vector. We assume the following conditions to hold 
for all considered regression problems: 

(Al) The Euclidean norm of the true signal is bounded, that is, ||/3o || 2 < a 0 for a fixed 
constant ao £ (0, oo). 

(A2) There is a decreasing function ci OW er : [0, oo) —t (0,oo) and an increasing function 
Cupper : [0, oo) —> (0, oo) such that for all J C [p] with |J| < 2 q and all 0 £ R J , the 
Hessian of the negative log-likelihood function is bounded as 

Clower (11 112 ) IJ d =< C upper (||/3|| 2 )I 7 . 

(A3) There is a constant change £ (0, oo) such that for all J C [p] with |J| < 2 q and all 

^\\Hj(p) - Hj(p ')|| Bp < Cchange • ||/3 - 0'\\ 2 , 

where || • || sp is the spectral norm of a matrix. 

Assumption (A2) provides control of the spectrum of the Hessian of the negative log- 
likelihood function, and (A3) yields control of the change of the Hessian. Together, (A2) and 
(A3) imply that for all e > 0, there is a <5 > 0 such that 

(1 - e)Hj(p 0 ) A Hj{0j) d (1 + e)Hj{0 o ), (2-1) 

for all JO J 0 with \J\ < 2q and 0j £ with \\0j — /3oII 2 < S; see Prop. 2.1 in Barber and 
Drton (2015). Note also that we consider sets J with cardinality 2 q in (A2) and (A3) because 
it allows us to make arguments concerning false models, with J 2 Jo, using properties of the 
true model given by the union J U Jq. 

Remark 2.1. When treating generalized linear models, some control of the size of the true 
coefficient vector 0q is indeed needed. For instance, in logistic regression, if the norm of 
0o is too large, then the binary response will take on one of its values with overwhelming 
probability. Keeping with the setting of logistic regression, Barber and Drton (2015) show 
how assumptions (A2) and (A3) hold with high probability in certain settings in which the 
covariates are generated as i.i.d. sample. Assumptions (A2) and (A3), or the implication 
from (2.1), also appear in earlier work on Bayesian information criteria for high-dimensional 
problems such as Chen and Chen (2012) or Luo and Chen (2013). 

Let {fj : J C [p], | J\ < q} be a family of probability density functions fj : R J —> [0, 00 ) 
that we use to define prior distributions in all (/-sparse models. We say that the family is 
log-Lipscliitz with respect to radius R > 0 and has bounded log-density ratios if there exist 
two constants F-\, F 2 £ [0, 00 ) such that the following conditions hold for all J C [p] with 

\J\ < T- 

(Bl) The function log fj is Fd-Lipschitz on the ball B R ( 0) = {0 £ R J : ||/3|| 2 < i?}, i.e., for 
all 0',0 £ -Br(O), we have 

I log fj{0')~ log fj(0)\ <F40'-0\\ 2 . 

(B2) For all 0 £ R J , 

l°g/j(/3) - l°g /j(0) < F 2 . 
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Example 2.1. If we take fj to be the density of a |J|-fold product of a centered normal 
distribution with variance a 2 , then (Bl) holds with F\ = R/a 2 and Fi = 0. 

3. Laplace approximation 

This section provides our main result. For a high-dimensional regression problem, we show 
that a Laplace approximation to the marginal likelihood of each sparse model, 

Evidence(J) := [ L(P)fj(P)dp , 

leads to an approximation error that, with high probability, is bounded uniformly across all 
models. To state our result, we adopt the notation 

a = 6(1 ± c) :<;=>■ a € [6(1 — c), 6(1 + c)]. 

Theorem 1. Suppose conditions (Al)-(A3) hold. Then, there are constants v, c samp i e , &mle £ 
(0, oo) depending only on (a 0) ci OW er, c upP er, c c han g e) such that if 

n > c samp ie • q 3 max{log(p), log 3 (n)}, 

then with probability at least 1 — p~ v the following two statements are true for all sparse 
models J C [p], |J| < q: 

(i) The MLE [3j satisfies ||/3j|| 2 < cimle- 

(ii) If additionally the family of prior densities {fj : J C [p\, \ J\ < q} satisfies the Lipschitz 
condition from (Bl) for radius R > cimle +1, and has log-density ratios bounded as in (B2), 
then there is a constant c Lap iace £ (0,oo) depending only on (a 0 , ci OW er, c upper , change, F 1 , F 2 ) 
such that 

Evidence (J) = L0j)fj0j) • ^ ± c Lap u ce ^^^ • 

Proof, (i) Bounded MLEs. It follows from Barber and Drton (2015, Sect. B.2) 1 that, with 
the claimed probability, the norms \\Pj \\2 for true models J (i.e., J D J 0 and | J\ < 2 q) are 
bounded by a constant. The result makes reference to an event for which all the claims we 
make subsequently are true. The bound on the norm of an MLE of a true model was obtained 
by comparing the maximal likelihood to the likelihood at the true parameter /3q. As we show 
now, for false sparse models, we may argue similarly but comparing to the likelihood at 0. 

Recall that a 0 is the bound on the norm of /3 0 assumed in (Al) and that the functions Ci ower 
and c upper in (A2) are decreasing and increasing in the norm of /3q, respectively. Throughout 
this part, we use the abbreviations 


Slower • QowerV^Oj? ^upper • ^upper 

1 In the proof of this theorem, we cite several results from Barber and Drton (2015, Sect. B.2, Lem. B.l). 

Although that paper treats the specific case of logistic regression, by examining the proofs of their results 

that we cite here, we can see that they hold more broadly for the general GLM case as long as we assume 
that the Hessian conditions hold, i.e., Conditions (Al)—(A3), and therefore we may use these results for the 
setting considered here. 
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First, we lower-bound the likelihood at 0 via a Taylor-expansion using the true model Jq. 
For some t £ [0,1], we have that 

logL(O) - log L(P 0 ) = ~/ 8 o s J 0 (Po) - \Po Hj 0 (t(3o)Po > ~Po s Jo {P 0 ) - ^nao c u PP er, 
where we have applied (A2). Lemma B.l in Barber and Drton (2015) yields that 

\0O s Jo(M\ < WHjM^sMEHMboW < Toaoy/nCnpper, 

where Tq can be bounded by a constant multiple of qlog(p). By our sample size assumption 
(i.e., the existence of the constant c samp i e ), we thus have that 

logL(O) — logL(^o) > -n-c i (3.1) 

for some constant C\ £ ( 0 , oo). 

Second, we may consider the true model J U Jo instead of J and apply (B.17) in Barber 
and Drton (2015) to obtain the bound 

log L0j) - log L(p 0 ) < 

\\Pj~Po\\ ■ ^V^Cupper • Tj\ Jo - min 1 11 p J - do 11, 2 ^° wer | j , (3-2) 

where Tj\j 0 can be bounded by a constant multiple of glog(p). Choosing our sample size 
constant c samp i e large enough, we may deduce from (3.2) that there is a constant C 2 £ (0, oo) 
such that 

\ogL0j) - logL(/? 0 ) < -n\\pj - Po\\c 2 

whenever \\$j — /?o|| > ci OW er/(2c c hange)- Using the fact that logL(0) < log(/3 j) for any model 
J, we may deduce from (3.2) that there is a constant C 2 £ (0,oo) such that 

logL( 0 ) - log L(/3o) < -n\\Pj - Po\\c 2 

whenever \\j3j - /3 0 || > Ci 0 wer/(2c c hange)- Together with (3.1), this implies that ||/3 j - /3 0 || is 
bounded by a constant C 3 . Having assumed (Al), we may conclude that the norm of fij is 
bounded by <20 + 03 . 

(ii) Laplace approximation. Fix J C [p\ with \J\ < q. In order to analyze the evidence of 
model J, we split the integration domain K * 7 into two regions, namely, a neighborhood J\f of 
the MLE j3j and the complement R J \A f. More precisely, we choose the neighborhood of the 
MLE as 

:= { P £ R J : \\HjCPj) 1/2 {P - Pj )|| 2 < \/5|J|log(n) } . 

Then the marginal likelihood, Evidence( J), is the sum of the two integrals 

h = f L(P)fj(p)dp , 

JM 

1 2 = [ L(P)fj(P)dP- 

J b+aa 
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We will estimate I\ via a quadratic approximation to the log-likelihood function. Outside of 
the region M , the quadratic approximation may no longer be accurate but due to concavity of 
the log-likelihood function, the integrand can be bounded by for an appropriately 

chosen constant c, which allows us to show that I 2 is negligible when n is sufficiently large. 

We now approximate and I 2 separately. Throughout this part we assume that we have 
a bound omle on the norms of the MLEs /3j in sparse models J with | J| < q. For notational 
convenience, we now let 


Clower : — C'lower (c \l LE ) , ^upper •— Cupper (omle)- 


(ii-a) Approximation of integral 1\. By a Taylor-expansion, for any /3 € R* 7 there is a 
t € [0,1] such that 


log L(0) = log L0j) -\{P~ Pj) T Hj (4j + t{0 
By (A3) and using that \t\ < 1, 


iP-M- 


Hj (Pj+t{0 - fijf) - Hj0j) < n • C cha nge \\P ft J 11 2 - 


Hence, 


1 a ,a a \ . 11 


log L(P) = log L{$j) - ^(/3 — $jY’ Hj(j3j)(P - $j) ± ^\\P~Pj\\l B C c hange ■ (3.3) 


Next, observe that (A2) implies that 

Hj0j)~ 112 A ^ 

We deduce that for any vector 0 € Af, 


1 . 7 . 


^Clower 


WP-hh < \/5|J|log(n)||H J (/3 J )- 1 / 2 || S p < J 5|J|1 ° SH . 

V ^Cjower 


(3.4) 


This gives 

logics) = log L(Pj) - \{p - M t Hj0j)(P -Pj)±)J |J|31 ° g3(n) • ^/ 12 4 5 c | hange . (3.5) 

Choosing the constant c samp i e large enough, we can ensure that the upper bound in (3.4) 
is no larger than 1. In other words, \\/3 — Pj \\2 < 1 for all points 0 £ Af. By our assumption 
that H/ 3 , 71|2 < amlej the set A f is thus contained in the ball 

H = {/3 € K' 7 : ||/3|| 2 < omle + 1} • 

Since, by (Bl), the logarithm of the prior density is Ti-Lipschitz on B 1 it follows form (3.4) 
that 

log fj(P) = \ogfj{0j)±F 1 \\0-0j\\2 = log fj0j) ± F x 


' 5| J| log(n) 


^Q ower 


(3.6) 
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Plugging (3.5) and (3.6) into Xi, and writing a = b - exp{±c} to denote a £ [b ■ e c , b ■ e c ], 
we find that 


Xi = L0j)fj(j3j) exp < ± | 


1 


|J| 3 log 3 (n) 


/ 5 Fl 

'Slower 


+ ' 


' I25c c 2 hang e 

4c fower 


(3.7) 


) M 


x / exp l — -{P — Pj) 1 Hj(/3j)(/3 — Pj) > d/3. 


In the last integral, change variables to £ = Hj(pj) 1 ^' 2 (p — pj) to see that 




exp — \{P~ Pj ) T - Pj) \ dp 


(det Hj(Pj)^ 

( (2tt)I j I \ 


- 1/2 

1/2 


det Hj(pj) 

- (27r)l J l \ 1/2 

det Hj(f3j) 


IICI|2<-\/ 5 l' 7 l lo g( n ) 

J -Pr{xfj| < 5|J|log(n)| 

I • exp{±l/ y/n}, 


ex P i -^11^112 


(3.8) 


where we use a tail bound for the x 2 ~distribution stated in Lemma A.l. We now substi¬ 
tute (3.8) into (3.7), and simplify the result using that e~ x >l — 2x and e x < 1 + 2x for all 
0 < x < 1. We find that 


1/2 


1 ± 2 1 


125c change . 1 / | J| 3 log 3 (n) 


4 r 3 

°lower 


Slower 


Tl 


(3.9) 

when the constant c samp i e is chosen large enough. 

(ii-b) Approximation of integral I 2 . Let /? be a point on the boundary of J\f. It then holds 
that 


(P - Pj) t HjCPj)(P - pj) = \/5|J|l0g(n) • II HjCPj) 1/2 (P - pj) I| 2 . 

We may deduce from (3.5) that 

log m < io g i ( /3j) - 

< log L(pj) - II HjCPj) 1,2 (P - pj)h ■ Virgin), 

for | J| 3 log 3 (n)/n sufficiently small, which can be ensured by choosing c samp ie large enough. 
The concavity of the log-likelihood function now implies that for all p qL Af we have 

log LOS) < log L0j) - \\HjCpj) 1,2 {p-pj)h ■ Virgin). (3.10) 

Moreover, using first assumption (B2) and then assumption (Bl), we have that 
log/j(/3) < log/j(0) + F 2 < log fj0j) + F 1 \\Pj\\ 2 + F 2 . 
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/R J \Af 


< 


(det Hj0j) 
< (det Hj 


Since \\Pj \\2 < cimle, it thus holds that 

log fj(P) < log fj0j) + FiOmle + F 2 . (3.11) 

Combining the bounds from (3.10) and (3.11), the integral can be bounded as 

X 2 < L0j)fj0j)e F ^ M ■ f exp {-II Hj0j) 1/2 {P - Pj)\\ 2 • 7|./| log(ra)} dp. 

(3-12) 

Changing variables to £ = -ffj(/7) 1/2 (/3 — fij) and applying Lemma A.2, we may bound the 
integral in (3.12) as 

exp { —||iij(/3j) 1/2 (/3 - Pj)\\ 2 ■ V\J\log(n)} d/3 

[ / _exp{ —\/|J|log(n) • ||£|| 2 }d£ 

'II4I|2>V 5 I J l lc, g( n ) 

*-i/2 . 4 (tt)I j I / 2 05\J\ log(?r) |J| 1 p _V 5 |j|i OB(n ) 
r (5 71) V\J\log{n) 

( ( 2 tt ) I j ' ^ 1/2 __ 2 V ^_ /5 . , , A 1 "' 72 ' 1 1 

ydetilj(/3j)J r(4|J|) U 1 1 ° S(n 7 nVBUf 

Stirling’s lower bound on the Gamma function gives 

(7I/2) |J|/2-1 = (|^|/2) |J| / 2 < 1 ui/a 

r(i|j|) r(i|j| + i) - 

Using this inequality, and returning to (3.12), we see that 

xe E l 0 ML E +F 2 2 eA /5 / 5elog(n) \ 171/2-1 

V « J 

Based on this fact, we certainly have the very loose bound that 

1/2 


1 

,(v / 5-l/2)-|J| + l 




i r iflMLE+^2 


. (3.13) 


(3.14) 


for all sufficiently large n. 

(ii-c) Combining the bounds. From (3.9) and (3.14), we obtain that 

1/2 

x 


Evidence(J) = h +1 2 = L0j)fj0j) ( i ] 


1 ± | e FlOMLE+F2 + 2 + 
for sufficiently large n, as desired. 


^det Hj(Pj) 

1 125c c 2 hange 

/ C lower 


20F 2 \ /1 J| 3 log 3 (n) 


Mower 


■ 


n 


(3.15) 

□ 
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Remark 3.1. The proof of Theorem 1 could be modified to handle other situations of interest. 
For instance, instead of a fixed Lipschitz constant F\ for all log prior densities, one could 
consider the case where log fj is Lipschitz with respect to a constant F\(J) that grows 
with the cardinality of |J|, e.g., at a rate of \/\J\ in which case the rate of square root of 
| J| 3 log 3 (n)/n could be modified to square root of | J| 4 log 3 (n)/n. The term e Fl( - J ' >aMIjE that 
would appear in (3.15) could be compensated using (3.13) in less crude of a way than when 
moving to (3.14). 


4. Numerical experiment for sparse Bayesian logistic regression 


In this section, we perform a simulation study to assess the approximation error in Laplace 
approximations to the marginal likelihood of logistic regression models. To this end, we 
generate independent covariate vectors X- t ..... X n with i.i.d. 1V(0,1) entries. For each choice 
of a (small) value of q , we take the true parameter vector fio £ R p to have the first q 
entries equal to two and the rest of the entries equal zero. So, Jo = {l,...,q}. We then 
generate independent binary responses Y\,... ,Y n , with values in {0,1} and distributed as 
(Yi\Xi) ~ Bernoulli(pj(Xj)), where 


Pi{x) 


( ex P (x t /3q) \ / Pi ( x ) \ 

yl + exp (z T /3o)y V 1 ~Pi(x)J 


An 


based on the usual (and canonical) logit link function. 

We record that the logistic regression model with covariates indexed by J C [p\ has the 
likelihood function 


HP) = ex P | Yi ' X uP - log (1 + exp (Xfjp)) |, P £ K J , (4.1) 

where, as previously defined, X,j = (Wj)jej denotes the subset of covariates for model J. 
The negative Hessian of the log-likelihood function is 


Hj(B) = Y. X ^ X " 


exp(X?j/3) 

(l + exp {XTjP)Y 


For Bayesian inference in the logistic regression model given by J, we consider as a prior 
distribution a standard normal distribution on , that is, the distribution of a random vector 
with | J| independent N( 0,1) coordinates. As in previous section, we denote the resulting prior 
density by fj. We then wish to approximate the evidence or marginal likelihood 


Evidence (J) := [ L((3)fj(p) d/3. 

Js. j 


As a first approximation, we use a Monte Carlo approach in which we simply draw inde¬ 
pendent samples [3 1 ,..., /3 B from the prior fj and estimate the evidence as 

1 B 

MonteCarlo( J) = — ^ T(/3 b ), 

° 6=1 
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Number of Samples, n 


Fig 1. Maximum Laplace approximation error, averaged over 100 data sets, as a function of the sample 
size n. The number of covariates is n/2, and the number of active covariates is bounded by q E {1,2,3}. 


where we use B = 50,000 in all of our simulations. As a second method, we compute the 
Laplace approximation 

Laplace(J) := L0j)fj0j) 

where fij is the maximum likelihood estimator in model J. For each choice of the number of 
covariates p , the model size g, and the sample size n, we calculate the Laplace approximation 
error as 

max | log MonteCarlo( J) — log Laplace( J) | . 

^C[p], | J\<q 

We consider n £ {50,60, 70,80,90,100} in our experiment. Since we wish to compute the 
Laplace approximation error of every g-sparse model, and the number of possible models is 
on the order of p q , we consider p = n/2 and q £ {1, 2, 3}. The Laplace approximation error, 
averaged across 100 independent simulations, is shown in Figure 1. We remark that the error 
in the Monte Carlo approximation to the marginal likelihood is negligible compared to the 
quantity plotted in Figure 1. With two independent runs of our Monte Carlo integration 
routine, we found the Monte Carlo error to be on the order of 0.05. 

For each q = 1,2,3, Figure 1 shows a decrease in Laplace approximation error as n 
increases. We emphasize that p and thus also the number of considered g-sparse models 
increase with n. As we increase the number of active covariates g, the Laplace approximation 
error increases. These facts are in agreement with Theorem 1. This said, the scope of this 
simulation experiment is clearly limited by the fact that only small values of q and moderate 
values of p and n are computationally feasible. 

5. Consistency of Bayesian variable selection 

In this section, we apply the result on uniform accuracy of the Laplace approximation (The¬ 
orem 1) to prove a high-dimensional consistency result for Bayesian variable selection. Here, 


' y /2 

det Hj(pj) I 
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consistency refers to the property that the probability of choosing the most parsimonious 
true model tends to one in a large-sample limit. As discussed in the Introduction, we consider 
priors of the form 

P,(J) « (*) 7 -l{|J|<g}, Jc[ P ], (5.1) 


where 7 > 0 is a parameter that allows one to interpolate between the case of a uniform 
distribution on models (7 = 0) and a prior for which the model cardinality | J| is uniformly 
distributed (7 = 1 ). 

Bayesian variable selection is based on maximizing the (unnormalized) posterior proba¬ 
bility 


Bayes 7 (J) := 



-7 

Evidence (J) 


(5.2) 


over J C [p], | J\ < q. Approximate Bayesian variable section can be based on maximizing 
instead the quantity 


Laplace 7 (J) 



-7 

Laplace (J). 


(5.3) 


We will identify asymptotic scenarios under which maximization of Laplace 7 yields consistent 
variable selection. Using Theorem 1, we obtain as a corollary that fully Bayesian variable 
selection, i.e., maximization of Bayes 7 , is consistent as well. 

To study consistency, we consider a sequence of variable selection problems indexed by the 
sample size n , where the n-th problem has p n covariates, true parameter /3o(n) with support 
Jo{n), and signal strength 

Pmin{n) = min \(/3 0 (n))j\. 

jeJo(n) 


In addition, let q n be the upper bound on the size of the considered models. The following 
consistency result is similar to the related results for extensions of the Bayesian information 
criterion (see, for instance, Chen and Chen, 2012; Barber and Drton, 2015). 

Theorem 2. Suppose that p n = n K for n > 0, that q n = ri^ for 0 < if < 1/3, that 
/3min(?i) = n~^ 2 for 0 < <f> < 1 — ip, and that k > if. Assume that (Al) holds for a 
fixed constant ao and that there a fixed functions ci OW er and c upper with respect to which the 
covariates satisfy the Hessian conditions (A2) and (A3) for all J D Jo(n) with |J| < 2 q n . 
Moreover, assume that for the considered family of prior densities {fj(-) ■ J C [p„], | J| < q n } 
there are constants F 3 ,F^ £ (0,oo) such that, uniformly for all |J| < q n , we have 


sup fj(P) <F 3 <o o, inf fj{P) > F 4 > 0, 

/3 ||p||2<aMLE 


where <Zmle is the constant from Theorem l(i). Then, for any 7 > 1— > model selection 

with Laplace 7 is consistent in the sense that the event 

J 0 (n) = argmax{Laplace 7 (J) : J C [p n ], |J| < q n } 
has probability tending to one as n —>• 00 . 

Together with Theorem 1, the proof of Theorem 2, which we give below, also shows 
consistency of the fully Bayesian procedure. 
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Corollary 5.1. Under the assumptions of Theorem 2, fully Bayesian model selection is 
consistent , that is, the event 

J 0 {n) = argmax{Bayes 7 (J) : J C [p n ], \ J\ < q n } 

has probability tending to one as n —» oo. 

Proof of Theorem 2. Our scaling assumptions for p ni q n and /3 m in (n) are such that the con¬ 
ditions imposed in Theorem 2.2 of Barber and Drton (2015) are met for n large enough. This 
theorem and Theorem l(i) in this paper then yield that there are constants v, e, Cf a i se , omle > 
0 such that with probability at least 1— pf the following three statements hold simultaneously: 

(a) For all | J| < q n with J D Join), 

l°gL(/3j) — logL(/3j o( „)) < (1 + e)(|J\J 0 (n)| + v)log(p n ) . (5.4) 

(b) For all | J| < q n with J Join), 

logL(/3 Jo (n)) - logL(/3j) > C'f a i se n/3 min (n) 2 . (5.5) 

(c) For all | J| < q n and some constant omle > 0, 

||/3j|| < omle- (5.6) 

In the remainder of this proof we show that these three facts, in combination with further 
technical results from Barber and Drton (2015), imply that 

Join) = argmax{Laplace 7 (J) : J C [p n ], |J| < q„}. (5.7) 

For simpler notation, we no longer indicate explicitly that p n , q n , (3 q and derived quantities 
vary with n. We will then show that 


log La p lacc 7 ^j^ = (!°g-P(Jo) - log P(J)) + (logL{j3 Jo ) - logL0jfj - iTVollog^ 

+ (log/j 0 0Jo) - log /j(/ 3 j)) + \ (logdet Hjifrj) - logdetPj 0 (/3j 0 )) (5.8) 
is positive for any model given by a set J ^ Jo of cardinality \J\ < q. We let 


Slower •— Slower(®ML e)? 


r ((*MLe)' 


False models. If J 2 Jt i, that is, if the model is false, we observe that 


> -7 log 


P 

\Jo\ 


> -jq log p. 


logP(Jo) - logP(J) = - 7 log j + 7 log 
Moreover, by (A2) and (5.6), 

log det Hj{(3j) - logdet H Jo 0 Jo ) > |J| • log(nci ower ) - | J 0 | • log(nc upper ) 


> —q log ( n —7 


Cupper 


min{ciower, 1}/ 
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Combining the lower bounds with (5.5), we obtain that 


106 T^Spy S «**"*■■ - l-'M’l - « 1< « (»’ Tn mmfaZ,.l> ) + ‘ 0g (S) 


> Cfal se n/3min - Q log 


Qipper 


• Mimp 1 ^ + log ■ 


v min{q ower , 1} 

By our scaling assumptions, the lower bound is positive for sufficiently large n. 


True models. It remains to resolve the case of J 3 Jo, that is, when model J is true. 
We record that from the proof of Theorem 2.2 in Barber and Drton (2015), it holds on the 
considered event of probability at least 1 — p~ v that for any J D J 0 , 


II — /5oII 2 < 


V ^upper 
^c lower T|AJo1, 


(5.9) 


where 

(J 0 + r ) log + log(4 ip v ) + r log( 2 p) . 

Under our scaling assumptions onp and q , it follows that ||/3j — /3oII 2 tends to zero as n —> 00 . 
We begin again by considering the prior on models, for which we have that 

logP(J 0 ) - logP(J) = 7 log _ j^jj' > - 7 |J\J 0 |logg + 7 |J\J 0 |log(p-q) 

> - 7 |J\J 0 |logg + 7 |J\J 0 |(l - e)logp 

for all n sufficiently large. Indeed, we assume that p = n K and q = ri^ with k> ij) such that 
p — q > p 1_e for any small constant e > 0 as long as p is sufficiently large relative to q. 

Next, if J D J 0 , then (A2) and (A3) allow us to relate Hj0j) and H Jo (/3j 0 ) to the 
respective Hessian at the true parameter, i.e., Hj(f3 0 ) and H Jo (/3 0 ). We find that 


r 2 = 


(1 — e') 3 


log 


det Hj0j) 


det Hj((3 0 ) \ 


AAHjJtT)) 2 l0g+ |J| los v 1 ~ 


^change 

Qower 


Wj-MV 


-|J 0 |log (1 + ^1151|/3 Jo -AII 2 ) • 

Note that by assuming n large enough, we may assume that \\j3j — /3oII 2 and ||/3j 0 — /3 0 1| 2 
are small enough for the logarithms to be well defined; recall (5.9). Using that x > log(l + 
x) for all x > — 1 and log(l — |) > —x for all 0 < x < 1, we see that 

log > log ( mZ A %\ ) - - Alb 

\ det Hj n (3j a ) J \ det Hj q (/3q) J Qower 


| Jq | ————1—1|/3j 0 /5 0 112 - 

Qower 


Under our scaling assumptions, q 3 log(p) = o(n), and thus applying (5.9) twice shows that 


-2|J|- 


Slower 


: ll Pj — M2 - | Jol 


Qower 


-\\Pjo - M 2 
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is larger than any small negative constant for n large enough. For simplicity, we take the 
lower bound as —1. By (A2), it holds that 

l°g ^deti?/ (^o')) = log< ^ et - Hj 0t j\j 0 ((3 0 ) T H Jo ((3 0 ) 1 Hj 0 ,j\j 0 Wo)) 

> | J\ J 0 | l0g(n) + | J\J 0 | log(ciower), 


because the eigenvalues of the Schur complement of Hj(/3q) are bounded the same way as 
the eigenvalues of Hj(f3 0 ); see, e.g., Chapter 2 of Zhang (2005). Hence, for sufficiently large 
n, the following is true for all J D J 0 : 

logdet Hj(Pj) — logdet Hj 0 (/3j 0 ) > | J\ Jo\ log(n) + | J\J 0 \ log(ci OW er) - 1- (5.10) 

Combining the bound for the model prior probabilities with (5.4) and (5.10), we have for 
any true model J D J 0 that 

Laplace 7 ( Jq) 


log 


Laplace (J) 


1 , 


> -(1 + e)(|J\J 0 | + v)log(p) + 7 |J\J 0 |(1 - e)log(p) + -1 J\ J 0 | log(n) 

f 3 


> 


- 7l J\M log(g) + \\ J\Jo\ (log + log 
,1 A J o\ (log (n) - logg 27 + 2 [(1 - e )7 - (1 + e)(l + v)\ log(p) + log ( < ^ 1 )) 


+ log (S)- L 

This lower bound is is positive for all n large because our assumption that p = n K , q = ri^ 
for 0 < ip < 1/3, and 

1 - 2V’ 

"7 > 1- 

7 2 (K-VO 

implies that 

\/n 


p(l+e)(l+C-7(l-?)g7 


= OO 


provided the constants e, z/, and e are chosen sufficiently small. 


(5.11) 

□ 


6. Discussion 

In this paper, we have shown that in the context of high-dimensional variable selection 
problems, the Laplace approximation can be accurate uniformly across a potentially very 
large number of sparse models. We then showed how this approximation result allows one to 
give results on the consistency of fully Bayesian techniques for variable selection. 

In practice, it is of course infeasible to evaluate the evidence or Laplace approximation for 
every single sparse regression model, and some search strategy must be adopted instead. Some 
related numerical experiments can be found in Chen and Chen (2008), Chen and Chen (2012), 
Zak-Szatkowska and Bogdan (2011), and Barber and Drton (2015), although that work 
considers BIC scores that drop some of the terms appearing in the Laplace approximation. 

Finally, we emphasize that the setup we considered concerns generalized linear models 
without dispersion parameter and with canonical link. The conditions from Luo and Chen 
(2013) could likely be used to extend our results to other situations. 
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Appendix A: Technical lemmas 

This section provides two lemmas that were used in the proof of Theorem 1. 
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Lemma A.l (Chi-square tail bound). Let xt denote a chi-square random variable with k 
degrees of freedom. Then, for any n > 3, 

P{xt < 5/clog(ro)} > 1 - ^7 > exp{-l/Vn}. 

Proof. Since log(n) > 1 when n > 3, we have that 

k + 2\Jk ■ k log(n) + 2k log(ra) < 5fclog(n). 

Using the chi-square tail bound in Laurent and Massart (2000), it thus holds that 

P {xl < 5fclog(n)} > P {xi < k + 2 y/k- k log(n) + 2 fclog(n)| 

> l _ g-fclog(n)_ 

Finally, for the last step, by the Taylor series for x H>■ e x , for all n > 3 we have 

r , r-, 111 1 

exp{-l /Vn} <1- -j= + - • - < 1-■ 

i/n 2 n n 

□ 


Lemma A.2. Let k > 1 be any integer, and let a, b > 0 be such that ab > 2 (k — 1). Then 


4(t r ) fe / 2 , 


,fc-i 


/ exp{— 6 ||^|| 2 }rfC < F / lzA , 

where the integral is taken over £ (E . 

Proof. We claim that the integral of interest is 

/ exp{-6||e|| 2 R=|^y f" < 


„—ab 


r k ~ 1 e~ r dr. 


(A.l) 


Indeed, in fc = 1 dimension, 


/ Il?ll2>a 


exp{-6||^|| 2 }dC = 2 


OO c\ 

e~ br dr = -e ~ ab , 
b 


t r—a 


which is what (A.l) evaluates to. If k > 2, then using polar coordinates (Anderson, 2003, 
Exercises 7.1-7.3), we find that 


k—2 


/ Il5ll2>a 


exp{- 6 ||£|| 2 }d£ = 2 tt 


" k ~ 1 e~ br dr ■ ]J 


r n/2 


= 27r 


" k ~ 1 e~ br dr ■ ]J 


cos l (6i)ddi 

\/^ r (|(* + 1 )) 


i=i J -*/ 2 
k—2 


t\ r (K* + 2 )) ’ 


which again agrees with the formula from (A.l). 

Now, the integral on the right-hand side of (A.l) defines the upper incomplete Gamma 
function and can be bounded as 

/» OO 

T{k,ab) = / r k ~ 1 e~ r dr < 2e~ ab {ab) k ~ 1 

J r=ab 

for ab > 2 (k — 1); see inequality (3.2) in Natalini and Palumbo (2000). This gives the bound 
that was to be proven. □ 







